Diskless Checkpointing Diskless Checkpointing

نویسندگان

  • James S. Plank
  • Michael Puening
  • Kai Li
  • Michael A. Puening
چکیده

The precursor to this work (where diskless checkpointing was rst described) was presented at FTCS-24 27]. Abstract Diskless Checkpointing is a technique for checkpointing the state of a long-running computation on a distributed system without relying on stable storage. As such, it eliminates the performance bottleneck of traditional checkpointing on distributed systems. In this paper, we motivate diskless checkpointing and present the basic diskless checkpointing scheme along with several variants for improved performance. The performance of the basic scheme and its variants is evaluated on a high-performance network of workstations and compared to traditional disk-based checkpointing. We conclude that diskless checkpointing is a desirable alternative to disk-based checkpointing that can improve the performance of distributed applications in the face of failures.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Novel Multi-core System for Fault Tolerance and Security with Diskless Checkpointing Capability

Improving fault tolerance within the clusters has become vital because of the drastic decrease in the Mean Time Between Failures (MTBF) in complex clusters. Checkpointing is one of the robust ways to improve fault tolerance by rolling back from a saved state in the event of a failure in the cluster. Since, checkpointing primarily relies on storage devices for storing the states at regular inter...

متن کامل

Efficient Diskless Checkpointing and Log Based Recovery Schemes

Checkpointing and message logging are the popular and generalpurpose tools for providing fault tolerance in distributed systems. Diskless checkpointing schemes enable frequent checkpointing without a performance penalty. The present work extends James S Plank‟s Diskless checkpointing scheme (N+1 Parity) by introducing ‘Timeout’ mechanism to checkpoint programs with high locality of reference. T...

متن کامل

Fault Tolerance and Checkpointing

Research and applications of clusters of workstations are growing rapidly. One of the major area is fault tolerance. This report describes two issues concerned: correctness and performance. After a number of techniques to improve performance are described, new research directions, diskless checkpointing and Java checkpointing, are introduced.

متن کامل

A Diskless Checkpointing Algorithm for Super-scale Architectures Applied to the Fast Fourier Transform

This paper discusses the issue of fault-tolerance in distributed computer systems with tens or hundreds of thousands of diskless processor units. Such systems, like the IBM BlueGene/L, are predicted to be deployed in the next five to ten years. Since a 100,000-processor system is going to be less reliable, scientific applications need to be able to recover from occurring failures more efficient...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1997